Data Mining Week 5: Unsupervised Learning

  • Principle Component Analysis
  • Linear Discriminant Analysis
  • K-Means Clustering
  • t-SNE
  • UMAP

Topic #1: Principle Component Analysis

We will do PCA on a Major League Baseball dataset from fangraphs.com

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [1]:
# Get working directory
import os
os.getcwd()
Out[1]:
'/Users/matthewberezo'
In [2]:
import pandas as pd
import seaborn as sns
%matplotlib inline
sns.set(style='white', rc={'figure.figsize':(20,20)})
import matplotlib.pyplot as plt
import numpy as np
In [3]:
# Read in csv file for World War 2 weather conditions that is stored in path:
import pandas as pd
pd.set_option('display.max_columns', None)
mlb_df = pd.read_csv('/Users/matthewberezo/Documents/FanGraphs_Leaderboard.csv')
In [4]:
mlb_df.head()
Out[4]:
Name Team G PA HR R RBI SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR playerid
0 Mike Trout Angels 121 545 42 101 98 10 18.2 % 20.2 % 0.367 0.304 0.297 0.440 0.664 0.443 185 6.6 66.0 1.8 8.4 10155
1 Cody Bellinger Dodgers 122 522 42 100 100 10 14.2 % 16.3 % 0.354 0.311 0.320 0.418 0.673 0.435 174 0.2 51.4 3.7 7.0 15998
2 Christian Yelich Brewers 112 500 41 88 89 24 12.8 % 20.6 % 0.361 0.353 0.333 0.424 0.693 0.444 174 5.0 54.6 -3.9 6.5 11477
3 Ketel Marte Diamondbacks 120 534 26 84 74 8 8.4 % 14.0 % 0.251 0.333 0.319 0.380 0.569 0.392 140 3.0 31.5 9.0 5.6 13613
4 Xander Bogaerts Red Sox 123 556 27 95 94 4 11.0 % 18.0 % 0.253 0.338 0.308 0.383 0.561 0.390 141 2.0 31.5 6.6 5.6 12161
In [5]:
mlb_df.shape
Out[5]:
(145, 22)
In [6]:
# Load libraries for PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
In [7]:
# Standardize the feature matrix
features = StandardScaler().fit_transform(mlb_df[['HR', 'R', 'SB', 'ISO', 'BABIP', 'wRC+', 'wOBA', 'Off', 'Def']])
In [8]:
# Create a PCA that will retain 99% of variance
pca = PCA(n_components= 5)
In [9]:
# Conduct PCA
features_pca = pca.fit_transform(features)
In [10]:
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_pca.shape[1])
Original number of features: 9
Reduced number of features: 5
In [11]:
print(pd.DataFrame(pca.components_))
          0         1         2         3         4         5         6  \
0  0.377352  0.365247  0.005552  0.396021  0.114932  0.426907  0.430558   
1 -0.302550  0.177540  0.593391 -0.250164  0.586142  0.036284  0.064068   
2 -0.211624 -0.178061 -0.261598 -0.163447  0.508231  0.117659  0.112540   
3  0.144799  0.116593  0.703068  0.078614 -0.295229 -0.129020 -0.137310   
4 -0.107970  0.872090 -0.271774 -0.296838 -0.070073 -0.158602 -0.128204   

          7         8  
0  0.428285 -0.055398  
1  0.121514  0.313844  
2  0.030749 -0.736620  
3 -0.038461 -0.583794  
4 -0.079565 -0.113998  
In [12]:
print(pca.explained_variance_ratio_)
[0.56053658 0.17047193 0.11846723 0.08171531 0.0339612 ]
In [13]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
Out[13]:
Text(0, 0.5, 'cumulative explained variance')
In [14]:
# Let's check the shape of our features_pca
features_pca.shape
Out[14]:
(145, 5)
In [15]:
features_pca_df = pd.DataFrame(features_pca)
features_pca_df.head()
Out[15]:
0 1 2 3 4
0 7.163457 0.046705 -0.990772 0.061662 -0.601606
1 6.335476 0.139616 -1.136429 -0.037313 -0.446262
2 6.477107 1.710883 -0.163068 1.538580 -1.719940
3 2.819954 1.109630 -0.727834 -0.963224 -0.108355
4 3.153717 0.810859 -0.417753 -1.145282 0.656922
In [16]:
mlb_df
Out[16]:
Name Team G PA HR R RBI SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR playerid
0 Mike Trout Angels 121 545 42 101 98 10 18.2 % 20.2 % 0.367 0.304 0.297 0.440 0.664 0.443 185 6.6 66.0 1.8 8.4 10155
1 Cody Bellinger Dodgers 122 522 42 100 100 10 14.2 % 16.3 % 0.354 0.311 0.320 0.418 0.673 0.435 174 0.2 51.4 3.7 7.0 15998
2 Christian Yelich Brewers 112 500 41 88 89 24 12.8 % 20.6 % 0.361 0.353 0.333 0.424 0.693 0.444 174 5.0 54.6 -3.9 6.5 11477
3 Ketel Marte Diamondbacks 120 534 26 84 74 8 8.4 % 14.0 % 0.251 0.333 0.319 0.380 0.569 0.392 140 3.0 31.5 9.0 5.6 13613
4 Xander Bogaerts Red Sox 123 556 27 95 94 4 11.0 % 18.0 % 0.253 0.338 0.308 0.383 0.561 0.390 141 2.0 31.5 6.6 5.6 12161
5 Alex Bregman Astros 121 540 30 94 83 4 17.0 % 12.8 % 0.276 0.266 0.279 0.407 0.555 0.399 156 -2.2 36.6 1.1 5.5 17678
6 Rafael Devers Red Sox 124 550 27 103 101 8 6.9 % 16.2 % 0.262 0.356 0.329 0.377 0.592 0.396 145 1.1 33.1 3.3 5.4 17350
7 Anthony Rendon Nationals 111 488 27 89 98 3 9.8 % 14.1 % 0.286 0.327 0.322 0.400 0.608 0.410 151 0.6 34.1 3.6 5.2 12861
8 Ronald Acuna Jr. Braves 126 583 35 105 85 29 9.8 % 24.5 % 0.241 0.348 0.296 0.376 0.537 0.381 133 5.5 31.0 1.7 5.0 18401
9 Marcus Semien Athletics 126 579 21 90 59 7 11.7 % 14.3 % 0.208 0.291 0.273 0.359 0.481 0.354 124 0.8 18.6 12.0 4.9 12533
10 Matt Chapman Athletics 122 522 29 81 70 0 10.3 % 20.7 % 0.270 0.277 0.259 0.343 0.529 0.361 129 0.1 19.2 13.3 4.9 16505
11 Mookie Betts Red Sox 126 593 21 115 65 13 14.8 % 15.0 % 0.218 0.305 0.284 0.390 0.502 0.372 129 4.2 26.1 2.6 4.8 13611
12 Trevor Story Rockies 113 511 28 91 74 17 8.4 % 25.4 % 0.277 0.355 0.296 0.362 0.573 0.385 122 5.1 20.1 11.1 4.6 12564
13 J.T. Realmuto Phillies 119 473 19 75 66 7 6.8 % 22.6 % 0.203 0.324 0.277 0.328 0.480 0.336 105 5.4 8.8 23.3 4.6 11739
14 DJ LeMahieu Yankees 113 508 21 87 86 4 7.1 % 13.6 % 0.202 0.359 0.338 0.385 0.540 0.386 142 -0.2 27.0 2.7 4.6 9874
15 Kris Bryant Cubs 119 522 25 89 61 2 11.7 % 21.3 % 0.243 0.329 0.286 0.385 0.529 0.383 136 3.2 28.3 0.7 4.5 15429
16 George Springer Astros 93 434 27 73 68 5 11.5 % 19.8 % 0.277 0.311 0.293 0.378 0.569 0.390 150 1.1 28.8 1.7 4.4 12856
17 Michael Brantley Astros 118 517 18 76 76 3 8.1 % 10.1 % 0.206 0.345 0.334 0.393 0.540 0.388 149 0.7 33.0 -6.0 4.4 4106
18 Javier Baez Cubs 122 517 28 84 81 10 4.4 % 27.5 % 0.256 0.346 0.285 0.315 0.541 0.349 114 2.9 12.6 14.4 4.2 12979
19 Peter Alonso Mets 124 530 40 76 97 1 11.1 % 25.3 % 0.333 0.293 0.271 0.374 0.603 0.398 151 -0.9 35.3 -8.8 4.2 19251
20 Carlos Santana Indians 123 535 29 89 78 4 17.0 % 15.0 % 0.253 0.296 0.290 0.411 0.543 0.395 145 0.5 31.7 -7.4 4.2 2396
21 Max Muncy Dodgers 121 503 32 85 86 3 14.9 % 23.7 % 0.276 0.282 0.260 0.374 0.536 0.378 136 2.0 26.4 -1.5 4.0 13301
22 Yoan Moncada White Sox 97 409 20 58 59 7 7.6 % 27.6 % 0.234 0.382 0.301 0.358 0.535 0.370 134 4.0 21.8 5.1 4.0 17232
23 Yasmani Grandal Brewers 118 479 20 55 61 5 15.7 % 20.9 % 0.213 0.287 0.254 0.376 0.467 0.357 118 -3.6 7.7 17.7 4.0 11368
24 Max Kepler Twins 116 526 34 87 84 1 10.1 % 16.0 % 0.280 0.243 0.256 0.337 0.537 0.360 123 -2.0 13.8 8.5 3.9 12144
25 Francisco Lindor Indians 107 491 21 73 53 18 7.5 % 14.7 % 0.220 0.314 0.298 0.353 0.518 0.358 120 0.0 12.8 10.1 3.9 12916
26 Freddie Freeman Braves 126 562 33 99 102 5 12.6 % 18.1 % 0.270 0.330 0.307 0.399 0.576 0.401 146 0.0 34.6 -13.2 3.8 5361
27 Josh Donaldson Braves 123 521 29 73 73 3 14.4 % 24.0 % 0.265 0.309 0.267 0.382 0.532 0.381 133 -3.1 20.0 1.7 3.8 5038
28 Juan Soto Nationals 114 504 28 79 83 12 15.5 % 19.8 % 0.267 0.318 0.290 0.401 0.557 0.397 143 -0.4 28.5 -6.6 3.7 20123
29 Jeff McNeil Mets 105 442 15 67 55 4 6.3 % 14.0 % 0.196 0.366 0.332 0.400 0.529 0.391 147 -2.4 25.3 -1.7 3.7 15362
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115 Todd Frazier Mets 101 389 16 45 50 1 7.2 % 22.9 % 0.184 0.266 0.232 0.303 0.416 0.306 92 -0.7 -5.1 1.0 0.8 785
116 Jose Abreu White Sox 124 536 28 63 94 2 4.9 % 22.2 % 0.223 0.303 0.273 0.313 0.496 0.333 109 -5.5 0.4 -10.9 0.8 15676
117 Wilson Ramos Mets 110 410 13 43 61 0 9.3 % 12.9 % 0.133 0.295 0.279 0.346 0.412 0.322 102 -4.8 -3.9 -1.4 0.8 1433
118 Alex Gordon Royals 119 509 12 66 63 5 7.7 % 15.9 % 0.144 0.289 0.260 0.333 0.404 0.314 93 -1.8 -6.4 -4.4 0.7 5209
119 Yolmer Sanchez White Sox 115 419 2 45 30 5 7.2 % 22.0 % 0.071 0.326 0.251 0.312 0.323 0.278 72 1.8 -13.5 5.5 0.7 11602
120 Eric Hosmer Padres 124 527 18 63 82 0 6.5 % 20.9 % 0.164 0.333 0.285 0.332 0.450 0.328 103 -3.7 -1.9 -9.5 0.6 3516
121 Renato Nunez Orioles 120 484 27 62 72 0 7.4 % 24.0 % 0.232 0.262 0.239 0.306 0.472 0.324 100 0.5 0.8 -12.0 0.6 14503
122 Justin Smoak Blue Jays 102 424 19 48 52 0 16.5 % 20.8 % 0.201 0.227 0.213 0.351 0.414 0.332 106 -3.5 0.0 -9.2 0.6 9054
123 Joey Votto Reds 113 486 12 60 39 4 11.7 % 20.6 % 0.149 0.315 0.262 0.352 0.410 0.328 98 -1.9 -3.3 -7.0 0.6 4314
124 Nomar Mazara Rangers 109 443 17 65 62 4 5.9 % 22.6 % 0.197 0.313 0.268 0.318 0.466 0.326 95 -0.3 -3.3 -7.6 0.5 14553
125 Matt Carpenter Cardinals 96 397 11 47 33 6 12.8 % 25.7 % 0.152 0.269 0.214 0.323 0.366 0.301 84 0.1 -8.2 -0.5 0.4 8090
126 Colin Moran Pirates 119 391 11 37 64 0 6.4 % 23.5 % 0.155 0.349 0.283 0.330 0.438 0.321 97 -1.0 -2.4 -6.7 0.4 16909
127 Domingo Santana Mariners 115 496 21 63 69 8 9.9 % 32.1 % 0.193 0.350 0.256 0.332 0.449 0.330 110 0.8 7.1 -21.6 0.3 10348
128 Mallex Smith Mariners 105 459 5 56 31 34 7.0 % 25.9 % 0.115 0.316 0.235 0.301 0.350 0.283 78 7.3 -5.7 -8.1 0.2 13608
129 Adam Jones Diamondbacks 115 449 14 61 57 2 5.1 % 18.3 % 0.161 0.304 0.270 0.317 0.431 0.315 90 -1.7 -8.0 -4.7 0.2 6368
130 Franmil Reyes - - - 118 424 29 48 53 0 7.8 % 28.1 % 0.255 0.259 0.239 0.295 0.494 0.323 99 -3.4 -3.9 -8.7 0.1 14566
131 C.J. Cron Twins 99 411 20 42 66 0 6.1 % 20.0 % 0.216 0.280 0.259 0.319 0.475 0.330 103 -5.0 -3.5 -9.6 0.1 12546
132 Nick Markakis Braves 104 416 9 57 55 2 10.3 % 12.7 % 0.145 0.307 0.284 0.358 0.429 0.336 103 -1.8 0.1 -12.7 0.1 5930
133 Josh Reddick Astros 113 449 10 45 37 4 6.2 % 11.4 % 0.117 0.271 0.260 0.302 0.377 0.288 81 0.3 -10.8 -4.4 0.1 3892
134 Jurickson Profar Athletics 107 398 15 45 51 7 6.8 % 15.3 % 0.176 0.204 0.204 0.269 0.380 0.274 69 1.4 -14.3 0.1 0.0 10815
135 Rougned Odor Rangers 111 450 21 58 67 8 8.9 % 31.6 % 0.215 0.252 0.203 0.281 0.418 0.293 73 -0.1 -16.0 -0.3 0.0 12282
136 Starlin Castro Marlins 124 518 11 41 58 2 3.5 % 16.6 % 0.118 0.294 0.262 0.286 0.380 0.280 74 -2.4 -20.7 2.4 -0.1 4579
137 Orlando Arcia Brewers 119 436 14 47 46 7 8.5 % 20.2 % 0.140 0.257 0.228 0.292 0.368 0.278 66 1.5 -18.2 2.6 -0.1 13185
138 Brandon Belt Giants 121 489 14 62 48 3 14.3 % 20.2 % 0.160 0.264 0.228 0.342 0.387 0.316 95 -2.8 -6.1 -11.7 -0.2 10264
139 Albert Pujols Angels 100 414 19 43 73 1 7.5 % 13.0 % 0.193 0.238 0.247 0.304 0.440 0.310 95 -3.4 -5.9 -10.5 -0.2 1177
140 Joe Panik - - - 113 421 3 42 30 4 9.0 % 9.5 % 0.083 0.262 0.243 0.314 0.325 0.280 71 -2.7 -18.7 2.9 -0.2 11936
141 Brandon Crawford Giants 115 443 9 46 49 3 8.8 % 22.3 % 0.123 0.273 0.224 0.296 0.347 0.276 69 -0.7 -19.0 2.2 -0.2 5343
142 Miguel Cabrera Tigers 111 451 9 32 50 0 8.2 % 20.2 % 0.111 0.341 0.283 0.344 0.393 0.316 95 -5.3 -8.2 -11.1 -0.3 1744
143 Ian Desmond Rockies 111 393 14 53 56 3 7.4 % 24.7 % 0.209 0.316 0.259 0.316 0.468 0.327 84 -0.8 -9.0 -9.0 -0.5 6885
144 Khris Davis Athletics 104 417 17 48 53 0 8.4 % 26.6 % 0.156 0.263 0.220 0.290 0.376 0.283 76 -1.4 -14.3 -10.4 -1.0 9112

145 rows × 22 columns

In [17]:
features_pca_df
Out[17]:
0 1 2 3 4
0 7.163457 0.046705 -0.990772 0.061662 -0.601606
1 6.335476 0.139616 -1.136429 -0.037313 -0.446262
2 6.477107 1.710883 -0.163068 1.538580 -1.719940
3 2.819954 1.109630 -0.727834 -0.963224 -0.108355
4 3.153717 0.810859 -0.417753 -1.145282 0.656922
5 3.767822 -0.775777 -0.956835 -0.166191 0.506293
6 3.682109 1.418552 -0.062899 -0.600222 0.833999
7 3.745332 0.272736 -0.155388 -1.068663 0.107924
8 3.468275 2.866516 -1.135644 2.082662 0.277447
9 1.124349 0.666665 -1.646959 -0.767478 0.910599
10 1.848976 -0.781189 -1.931742 -1.392317 0.226418
11 2.416591 1.389762 -0.974334 0.549071 1.940530
12 2.564253 2.134466 -1.306906 -0.073377 -0.123774
13 -0.177080 1.477495 -2.202597 -1.861905 0.163146
14 2.284652 1.316423 0.638589 -1.235716 0.614196
15 2.595826 0.237871 0.190829 -0.855661 0.675992
16 2.860431 -0.116302 -0.163799 -0.602959 -0.626130
17 2.271833 0.662077 1.556047 -0.743958 0.145621
18 1.383210 1.364054 -1.517991 -0.866454 0.092600
19 4.220781 -1.826725 0.269665 0.199999 -0.636297
20 3.212805 -0.600136 0.374346 0.235698 0.552520
21 2.793340 -1.026621 -0.574035 -0.016842 0.311607
22 1.365975 1.564103 0.775553 -1.357833 -1.179505
23 -0.124919 0.188834 -1.823291 -1.598099 -0.911156
24 1.922781 -1.678798 -2.273136 -0.431478 0.609988
25 0.730232 1.679528 -1.425933 0.267508 -0.502998
26 4.071394 -0.245033 1.166645 0.624891 0.887608
27 2.173770 -0.463921 -0.265444 -0.625677 -0.303284
28 3.008465 0.362747 0.388851 0.779027 -0.413701
29 1.682210 1.278092 1.596625 -1.258724 -0.357079
... ... ... ... ... ...
115 -2.179427 -1.294600 -0.389104 -0.359911 -0.353812
116 0.066757 -1.350860 0.841830 0.516561 0.079274
117 -2.180214 -0.624994 0.661046 -0.790728 -0.274275
118 -1.959609 -0.209160 0.325497 0.277754 0.888214
119 -4.394372 1.118573 0.364613 -0.942747 0.353546
120 -0.947702 -0.381256 1.596174 -0.253997 0.590249
121 -0.338802 -2.314139 0.359477 0.812588 0.266936
122 -1.179340 -2.504501 0.075257 0.550211 -0.267248
123 -1.598834 0.005374 1.129519 0.019851 0.459754
124 -1.008948 -0.398379 0.816302 0.297072 0.457133
125 -2.858626 -0.560391 -0.266072 0.219321 -0.111412
126 -2.098375 -0.017265 1.995653 -0.864019 -0.689265
127 -0.045333 -0.021842 2.603459 1.401998 0.113289
128 -3.213481 2.894274 0.125258 3.391718 -0.388906
129 -1.904938 -0.454841 0.634460 -0.107767 0.624990
130 -0.606216 -2.598737 0.016295 0.571272 -0.647851
131 -1.168313 -1.827534 0.862256 0.156626 -0.739242
132 -1.525076 -0.402454 1.822498 0.164773 0.457300
133 -3.453756 -0.748603 0.305243 0.283996 0.196160
134 -3.571114 -1.943860 -1.646102 1.131414 -0.077629
135 -2.305392 -1.272602 -1.269329 1.016086 0.125058
136 -3.981979 -0.448362 -0.002354 -0.560630 0.025723
137 -3.781950 -0.753516 -1.012397 0.445206 0.114493
138 -1.809407 -1.291572 0.709510 0.846215 0.780640
139 -1.953031 -2.408003 0.260729 0.780584 -0.350706
140 -4.670691 -0.326142 -0.320982 -0.279760 0.351752
141 -4.086284 -0.619562 -0.386633 -0.210480 0.342226
142 -2.869809 -0.182913 2.498344 -0.543584 -0.559063
143 -1.670702 -0.634915 1.143966 0.223645 -0.059816
144 -3.041683 -1.927617 0.555723 0.640937 0.407794

145 rows × 5 columns

In [18]:
# Let's plot principle components with player names labeled:
features_pca_df = features_pca_df.add_prefix('pca')
In [19]:
mlb_df_pca = pd.concat([mlb_df, features_pca_df], axis = 1)
In [20]:
mlb_df_pca
Out[20]:
Name Team G PA HR R RBI SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR playerid pca0 pca1 pca2 pca3 pca4
0 Mike Trout Angels 121 545 42 101 98 10 18.2 % 20.2 % 0.367 0.304 0.297 0.440 0.664 0.443 185 6.6 66.0 1.8 8.4 10155 7.163457 0.046705 -0.990772 0.061662 -0.601606
1 Cody Bellinger Dodgers 122 522 42 100 100 10 14.2 % 16.3 % 0.354 0.311 0.320 0.418 0.673 0.435 174 0.2 51.4 3.7 7.0 15998 6.335476 0.139616 -1.136429 -0.037313 -0.446262
2 Christian Yelich Brewers 112 500 41 88 89 24 12.8 % 20.6 % 0.361 0.353 0.333 0.424 0.693 0.444 174 5.0 54.6 -3.9 6.5 11477 6.477107 1.710883 -0.163068 1.538580 -1.719940
3 Ketel Marte Diamondbacks 120 534 26 84 74 8 8.4 % 14.0 % 0.251 0.333 0.319 0.380 0.569 0.392 140 3.0 31.5 9.0 5.6 13613 2.819954 1.109630 -0.727834 -0.963224 -0.108355
4 Xander Bogaerts Red Sox 123 556 27 95 94 4 11.0 % 18.0 % 0.253 0.338 0.308 0.383 0.561 0.390 141 2.0 31.5 6.6 5.6 12161 3.153717 0.810859 -0.417753 -1.145282 0.656922
5 Alex Bregman Astros 121 540 30 94 83 4 17.0 % 12.8 % 0.276 0.266 0.279 0.407 0.555 0.399 156 -2.2 36.6 1.1 5.5 17678 3.767822 -0.775777 -0.956835 -0.166191 0.506293
6 Rafael Devers Red Sox 124 550 27 103 101 8 6.9 % 16.2 % 0.262 0.356 0.329 0.377 0.592 0.396 145 1.1 33.1 3.3 5.4 17350 3.682109 1.418552 -0.062899 -0.600222 0.833999
7 Anthony Rendon Nationals 111 488 27 89 98 3 9.8 % 14.1 % 0.286 0.327 0.322 0.400 0.608 0.410 151 0.6 34.1 3.6 5.2 12861 3.745332 0.272736 -0.155388 -1.068663 0.107924
8 Ronald Acuna Jr. Braves 126 583 35 105 85 29 9.8 % 24.5 % 0.241 0.348 0.296 0.376 0.537 0.381 133 5.5 31.0 1.7 5.0 18401 3.468275 2.866516 -1.135644 2.082662 0.277447
9 Marcus Semien Athletics 126 579 21 90 59 7 11.7 % 14.3 % 0.208 0.291 0.273 0.359 0.481 0.354 124 0.8 18.6 12.0 4.9 12533 1.124349 0.666665 -1.646959 -0.767478 0.910599
10 Matt Chapman Athletics 122 522 29 81 70 0 10.3 % 20.7 % 0.270 0.277 0.259 0.343 0.529 0.361 129 0.1 19.2 13.3 4.9 16505 1.848976 -0.781189 -1.931742 -1.392317 0.226418
11 Mookie Betts Red Sox 126 593 21 115 65 13 14.8 % 15.0 % 0.218 0.305 0.284 0.390 0.502 0.372 129 4.2 26.1 2.6 4.8 13611 2.416591 1.389762 -0.974334 0.549071 1.940530
12 Trevor Story Rockies 113 511 28 91 74 17 8.4 % 25.4 % 0.277 0.355 0.296 0.362 0.573 0.385 122 5.1 20.1 11.1 4.6 12564 2.564253 2.134466 -1.306906 -0.073377 -0.123774
13 J.T. Realmuto Phillies 119 473 19 75 66 7 6.8 % 22.6 % 0.203 0.324 0.277 0.328 0.480 0.336 105 5.4 8.8 23.3 4.6 11739 -0.177080 1.477495 -2.202597 -1.861905 0.163146
14 DJ LeMahieu Yankees 113 508 21 87 86 4 7.1 % 13.6 % 0.202 0.359 0.338 0.385 0.540 0.386 142 -0.2 27.0 2.7 4.6 9874 2.284652 1.316423 0.638589 -1.235716 0.614196
15 Kris Bryant Cubs 119 522 25 89 61 2 11.7 % 21.3 % 0.243 0.329 0.286 0.385 0.529 0.383 136 3.2 28.3 0.7 4.5 15429 2.595826 0.237871 0.190829 -0.855661 0.675992
16 George Springer Astros 93 434 27 73 68 5 11.5 % 19.8 % 0.277 0.311 0.293 0.378 0.569 0.390 150 1.1 28.8 1.7 4.4 12856 2.860431 -0.116302 -0.163799 -0.602959 -0.626130
17 Michael Brantley Astros 118 517 18 76 76 3 8.1 % 10.1 % 0.206 0.345 0.334 0.393 0.540 0.388 149 0.7 33.0 -6.0 4.4 4106 2.271833 0.662077 1.556047 -0.743958 0.145621
18 Javier Baez Cubs 122 517 28 84 81 10 4.4 % 27.5 % 0.256 0.346 0.285 0.315 0.541 0.349 114 2.9 12.6 14.4 4.2 12979 1.383210 1.364054 -1.517991 -0.866454 0.092600
19 Peter Alonso Mets 124 530 40 76 97 1 11.1 % 25.3 % 0.333 0.293 0.271 0.374 0.603 0.398 151 -0.9 35.3 -8.8 4.2 19251 4.220781 -1.826725 0.269665 0.199999 -0.636297
20 Carlos Santana Indians 123 535 29 89 78 4 17.0 % 15.0 % 0.253 0.296 0.290 0.411 0.543 0.395 145 0.5 31.7 -7.4 4.2 2396 3.212805 -0.600136 0.374346 0.235698 0.552520
21 Max Muncy Dodgers 121 503 32 85 86 3 14.9 % 23.7 % 0.276 0.282 0.260 0.374 0.536 0.378 136 2.0 26.4 -1.5 4.0 13301 2.793340 -1.026621 -0.574035 -0.016842 0.311607
22 Yoan Moncada White Sox 97 409 20 58 59 7 7.6 % 27.6 % 0.234 0.382 0.301 0.358 0.535 0.370 134 4.0 21.8 5.1 4.0 17232 1.365975 1.564103 0.775553 -1.357833 -1.179505
23 Yasmani Grandal Brewers 118 479 20 55 61 5 15.7 % 20.9 % 0.213 0.287 0.254 0.376 0.467 0.357 118 -3.6 7.7 17.7 4.0 11368 -0.124919 0.188834 -1.823291 -1.598099 -0.911156
24 Max Kepler Twins 116 526 34 87 84 1 10.1 % 16.0 % 0.280 0.243 0.256 0.337 0.537 0.360 123 -2.0 13.8 8.5 3.9 12144 1.922781 -1.678798 -2.273136 -0.431478 0.609988
25 Francisco Lindor Indians 107 491 21 73 53 18 7.5 % 14.7 % 0.220 0.314 0.298 0.353 0.518 0.358 120 0.0 12.8 10.1 3.9 12916 0.730232 1.679528 -1.425933 0.267508 -0.502998
26 Freddie Freeman Braves 126 562 33 99 102 5 12.6 % 18.1 % 0.270 0.330 0.307 0.399 0.576 0.401 146 0.0 34.6 -13.2 3.8 5361 4.071394 -0.245033 1.166645 0.624891 0.887608
27 Josh Donaldson Braves 123 521 29 73 73 3 14.4 % 24.0 % 0.265 0.309 0.267 0.382 0.532 0.381 133 -3.1 20.0 1.7 3.8 5038 2.173770 -0.463921 -0.265444 -0.625677 -0.303284
28 Juan Soto Nationals 114 504 28 79 83 12 15.5 % 19.8 % 0.267 0.318 0.290 0.401 0.557 0.397 143 -0.4 28.5 -6.6 3.7 20123 3.008465 0.362747 0.388851 0.779027 -0.413701
29 Jeff McNeil Mets 105 442 15 67 55 4 6.3 % 14.0 % 0.196 0.366 0.332 0.400 0.529 0.391 147 -2.4 25.3 -1.7 3.7 15362 1.682210 1.278092 1.596625 -1.258724 -0.357079
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115 Todd Frazier Mets 101 389 16 45 50 1 7.2 % 22.9 % 0.184 0.266 0.232 0.303 0.416 0.306 92 -0.7 -5.1 1.0 0.8 785 -2.179427 -1.294600 -0.389104 -0.359911 -0.353812
116 Jose Abreu White Sox 124 536 28 63 94 2 4.9 % 22.2 % 0.223 0.303 0.273 0.313 0.496 0.333 109 -5.5 0.4 -10.9 0.8 15676 0.066757 -1.350860 0.841830 0.516561 0.079274
117 Wilson Ramos Mets 110 410 13 43 61 0 9.3 % 12.9 % 0.133 0.295 0.279 0.346 0.412 0.322 102 -4.8 -3.9 -1.4 0.8 1433 -2.180214 -0.624994 0.661046 -0.790728 -0.274275
118 Alex Gordon Royals 119 509 12 66 63 5 7.7 % 15.9 % 0.144 0.289 0.260 0.333 0.404 0.314 93 -1.8 -6.4 -4.4 0.7 5209 -1.959609 -0.209160 0.325497 0.277754 0.888214
119 Yolmer Sanchez White Sox 115 419 2 45 30 5 7.2 % 22.0 % 0.071 0.326 0.251 0.312 0.323 0.278 72 1.8 -13.5 5.5 0.7 11602 -4.394372 1.118573 0.364613 -0.942747 0.353546
120 Eric Hosmer Padres 124 527 18 63 82 0 6.5 % 20.9 % 0.164 0.333 0.285 0.332 0.450 0.328 103 -3.7 -1.9 -9.5 0.6 3516 -0.947702 -0.381256 1.596174 -0.253997 0.590249
121 Renato Nunez Orioles 120 484 27 62 72 0 7.4 % 24.0 % 0.232 0.262 0.239 0.306 0.472 0.324 100 0.5 0.8 -12.0 0.6 14503 -0.338802 -2.314139 0.359477 0.812588 0.266936
122 Justin Smoak Blue Jays 102 424 19 48 52 0 16.5 % 20.8 % 0.201 0.227 0.213 0.351 0.414 0.332 106 -3.5 0.0 -9.2 0.6 9054 -1.179340 -2.504501 0.075257 0.550211 -0.267248
123 Joey Votto Reds 113 486 12 60 39 4 11.7 % 20.6 % 0.149 0.315 0.262 0.352 0.410 0.328 98 -1.9 -3.3 -7.0 0.6 4314 -1.598834 0.005374 1.129519 0.019851 0.459754
124 Nomar Mazara Rangers 109 443 17 65 62 4 5.9 % 22.6 % 0.197 0.313 0.268 0.318 0.466 0.326 95 -0.3 -3.3 -7.6 0.5 14553 -1.008948 -0.398379 0.816302 0.297072 0.457133
125 Matt Carpenter Cardinals 96 397 11 47 33 6 12.8 % 25.7 % 0.152 0.269 0.214 0.323 0.366 0.301 84 0.1 -8.2 -0.5 0.4 8090 -2.858626 -0.560391 -0.266072 0.219321 -0.111412
126 Colin Moran Pirates 119 391 11 37 64 0 6.4 % 23.5 % 0.155 0.349 0.283 0.330 0.438 0.321 97 -1.0 -2.4 -6.7 0.4 16909 -2.098375 -0.017265 1.995653 -0.864019 -0.689265
127 Domingo Santana Mariners 115 496 21 63 69 8 9.9 % 32.1 % 0.193 0.350 0.256 0.332 0.449 0.330 110 0.8 7.1 -21.6 0.3 10348 -0.045333 -0.021842 2.603459 1.401998 0.113289
128 Mallex Smith Mariners 105 459 5 56 31 34 7.0 % 25.9 % 0.115 0.316 0.235 0.301 0.350 0.283 78 7.3 -5.7 -8.1 0.2 13608 -3.213481 2.894274 0.125258 3.391718 -0.388906
129 Adam Jones Diamondbacks 115 449 14 61 57 2 5.1 % 18.3 % 0.161 0.304 0.270 0.317 0.431 0.315 90 -1.7 -8.0 -4.7 0.2 6368 -1.904938 -0.454841 0.634460 -0.107767 0.624990
130 Franmil Reyes - - - 118 424 29 48 53 0 7.8 % 28.1 % 0.255 0.259 0.239 0.295 0.494 0.323 99 -3.4 -3.9 -8.7 0.1 14566 -0.606216 -2.598737 0.016295 0.571272 -0.647851
131 C.J. Cron Twins 99 411 20 42 66 0 6.1 % 20.0 % 0.216 0.280 0.259 0.319 0.475 0.330 103 -5.0 -3.5 -9.6 0.1 12546 -1.168313 -1.827534 0.862256 0.156626 -0.739242
132 Nick Markakis Braves 104 416 9 57 55 2 10.3 % 12.7 % 0.145 0.307 0.284 0.358 0.429 0.336 103 -1.8 0.1 -12.7 0.1 5930 -1.525076 -0.402454 1.822498 0.164773 0.457300
133 Josh Reddick Astros 113 449 10 45 37 4 6.2 % 11.4 % 0.117 0.271 0.260 0.302 0.377 0.288 81 0.3 -10.8 -4.4 0.1 3892 -3.453756 -0.748603 0.305243 0.283996 0.196160
134 Jurickson Profar Athletics 107 398 15 45 51 7 6.8 % 15.3 % 0.176 0.204 0.204 0.269 0.380 0.274 69 1.4 -14.3 0.1 0.0 10815 -3.571114 -1.943860 -1.646102 1.131414 -0.077629
135 Rougned Odor Rangers 111 450 21 58 67 8 8.9 % 31.6 % 0.215 0.252 0.203 0.281 0.418 0.293 73 -0.1 -16.0 -0.3 0.0 12282 -2.305392 -1.272602 -1.269329 1.016086 0.125058
136 Starlin Castro Marlins 124 518 11 41 58 2 3.5 % 16.6 % 0.118 0.294 0.262 0.286 0.380 0.280 74 -2.4 -20.7 2.4 -0.1 4579 -3.981979 -0.448362 -0.002354 -0.560630 0.025723
137 Orlando Arcia Brewers 119 436 14 47 46 7 8.5 % 20.2 % 0.140 0.257 0.228 0.292 0.368 0.278 66 1.5 -18.2 2.6 -0.1 13185 -3.781950 -0.753516 -1.012397 0.445206 0.114493
138 Brandon Belt Giants 121 489 14 62 48 3 14.3 % 20.2 % 0.160 0.264 0.228 0.342 0.387 0.316 95 -2.8 -6.1 -11.7 -0.2 10264 -1.809407 -1.291572 0.709510 0.846215 0.780640
139 Albert Pujols Angels 100 414 19 43 73 1 7.5 % 13.0 % 0.193 0.238 0.247 0.304 0.440 0.310 95 -3.4 -5.9 -10.5 -0.2 1177 -1.953031 -2.408003 0.260729 0.780584 -0.350706
140 Joe Panik - - - 113 421 3 42 30 4 9.0 % 9.5 % 0.083 0.262 0.243 0.314 0.325 0.280 71 -2.7 -18.7 2.9 -0.2 11936 -4.670691 -0.326142 -0.320982 -0.279760 0.351752
141 Brandon Crawford Giants 115 443 9 46 49 3 8.8 % 22.3 % 0.123 0.273 0.224 0.296 0.347 0.276 69 -0.7 -19.0 2.2 -0.2 5343 -4.086284 -0.619562 -0.386633 -0.210480 0.342226
142 Miguel Cabrera Tigers 111 451 9 32 50 0 8.2 % 20.2 % 0.111 0.341 0.283 0.344 0.393 0.316 95 -5.3 -8.2 -11.1 -0.3 1744 -2.869809 -0.182913 2.498344 -0.543584 -0.559063
143 Ian Desmond Rockies 111 393 14 53 56 3 7.4 % 24.7 % 0.209 0.316 0.259 0.316 0.468 0.327 84 -0.8 -9.0 -9.0 -0.5 6885 -1.670702 -0.634915 1.143966 0.223645 -0.059816
144 Khris Davis Athletics 104 417 17 48 53 0 8.4 % 26.6 % 0.156 0.263 0.220 0.290 0.376 0.283 76 -1.4 -14.3 -10.4 -1.0 9112 -3.041683 -1.927617 0.555723 0.640937 0.407794

145 rows × 27 columns

In [21]:
plt.scatter(mlb_df_pca.pca0, mlb_df_pca.pca1, s=mlb_df.WAR**3)
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/matplotlib/collections.py:857: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor
Out[21]:
<matplotlib.collections.PathCollection at 0x1a19438860>
In [22]:
# We can also do ths in seaborn and label our points
plot_w_text = sns.regplot(data = mlb_df_pca, x = 'pca0', y = 'pca1', color = 'red', fit_reg = False)
for line in range(0,mlb_df_pca.shape[0]):
     plot_w_text.text(mlb_df_pca.pca0[line]+0.2, mlb_df_pca.pca1[line], mlb_df_pca.Name[line], horizontalalignment='left', size='medium', color='black', weight='semibold')

Linear discriminant analysis for dimensionality reduction

https://scikit-learn.org/stable/modules/lda_qda.html

Again, LDA can be used for supervised dimensionality reduction by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes.

We will again use our wine dataset for LDA:

In [23]:
wine_df = pd.read_csv('/Users/matthewberezo/Documents/wineQualityReds.csv')
wine_df = wine_df.drop(['Unnamed: 0'], axis=1)
In [24]:
wine_df.head()
Out[24]:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [25]:
wine_df.shape
Out[25]:
(1599, 12)
In [26]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
In [27]:
# Create and run an LDA
lda = LinearDiscriminantAnalysis(n_components=2)
In [28]:
wine_df['quality'].unique()
Out[28]:
array([5, 6, 7, 4, 8, 3])
In [29]:
x_lda = lda.fit(wine_df[wine_df.columns[0:11]], wine_df['quality']).transform(wine_df[wine_df.columns[0:11]])
In [30]:
# Print the number of features
print('Original number of features:', wine_df[wine_df.columns[0:11]].shape[1])
print('Reduced number of features:', x_lda.shape[1])
Original number of features: 11
Reduced number of features: 2
In [31]:
# Create array of explained variance ratios
lda_var_ratios = lda.explained_variance_ratio_
In [32]:
lda_var_ratios
Out[32]:
array([0.84961759, 0.10277776])
In [33]:
lda_df = pd.DataFrame(x_lda)
lda_df.shape
Out[33]:
(1599, 2)
In [34]:
lda_df = lda_df.add_prefix('lda')
lda_df.head()
Out[34]:
lda0 lda1
0 -1.513044 -0.530957
1 -1.281523 -0.405686
2 -1.118752 -0.135363
3 0.025156 0.972790
4 -1.513044 -0.530957
In [35]:
wine_df_lda = pd.concat([wine_df, lda_df], axis = 1)
wine_df_lda.head()
Out[35]:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality lda0 lda1
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 -1.513044 -0.530957
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 -1.281523 -0.405686
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 -1.118752 -0.135363
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 0.025156 0.972790
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 -1.513044 -0.530957
In [36]:
lda_plot = sns.lmplot(data = wine_df_lda, x = 'lda0', y = 'lda1', hue = 'quality', fit_reg = False)

K-Means Clustering

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

K-means clustering is a non-supervised approach for partitioning a dataset into K distinct, non-overlapping clusters (i.e., "grouping" data)

In [37]:
# We will reuse the MLB data for this exercise
mlb_df.head()
Out[37]:
Name Team G PA HR R RBI SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR playerid
0 Mike Trout Angels 121 545 42 101 98 10 18.2 % 20.2 % 0.367 0.304 0.297 0.440 0.664 0.443 185 6.6 66.0 1.8 8.4 10155
1 Cody Bellinger Dodgers 122 522 42 100 100 10 14.2 % 16.3 % 0.354 0.311 0.320 0.418 0.673 0.435 174 0.2 51.4 3.7 7.0 15998
2 Christian Yelich Brewers 112 500 41 88 89 24 12.8 % 20.6 % 0.361 0.353 0.333 0.424 0.693 0.444 174 5.0 54.6 -3.9 6.5 11477
3 Ketel Marte Diamondbacks 120 534 26 84 74 8 8.4 % 14.0 % 0.251 0.333 0.319 0.380 0.569 0.392 140 3.0 31.5 9.0 5.6 13613
4 Xander Bogaerts Red Sox 123 556 27 95 94 4 11.0 % 18.0 % 0.253 0.338 0.308 0.383 0.561 0.390 141 2.0 31.5 6.6 5.6 12161
In [38]:
# Make k-means clusterer
from sklearn.cluster import KMeans
clusterer = KMeans(7, random_state=1)
In [39]:
# We first will want to rescale our variables
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(mlb_df[['HR', 'R', 'RBI', 'SB', 'wRC+', 'BsR', 'Off', 'Def', 'WAR']])
data_transformed = mms.transform(mlb_df[['HR', 'R', 'RBI', 'SB', 'wRC+', 'BsR', 'Off', 'Def', 'WAR']])
In [40]:
# Fit clusterer
clusterer.fit(data_transformed)
Out[40]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=7, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)
In [41]:
# Predict values
mlb_df['clust_grp'] = clusterer.predict(data_transformed)
In [42]:
mlb_df['clust_grp'].unique()
Out[42]:
array([6, 3, 4, 0, 1, 5, 2])
In [43]:
mlb_df = mlb_df[(mlb_df['WAR'] >= 0)]
In [44]:
sns.lmplot(data = mlb_df, x = 'Def', y = 'Off', hue = 'clust_grp', fit_reg = False)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x1a1a2ecf60>
In [45]:
# Let's recreate this scatterplot with plotly
import plotly.express as px
fig = px.scatter(mlb_df, x="Def", y="Off", color="clust_grp"#,
                 ,size= 'WAR'
                 ,hover_data=['Name'])
fig.show()
In [46]:
# We can also evaluate k-means clustering with SSE
# k means determine k
from sklearn import metrics
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data_transformed)
    Sum_of_squared_distances.append(km.inertia_)
    
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

As we increase the cluster group, the SSE will continue to decrease. This doesn't mean that we need to continue increasing the number of clusters (eventually, this would defeat the purpose). Instead, we want to choose k where the SSE has a major drop off.

In [47]:
# Let's use the wine dataset and same variables to see if we get more defined clusters


from sklearn.manifold import TSNE
x_embedded = TSNE(n_components = 3
                  ,perplexity = 25
                  ,learning_rate = 1000#,
                 #n_iter = 10000
                 #n_iter_without_progress = ???
                 ).fit_transform(wine_df[wine_df.columns[0:11]])
In [48]:
pd.DataFrame(x_embedded).head()
Out[48]:
0 1 2
0 -54.326103 18.997721 60.703758
1 2.697969 -7.670809 -92.412666
2 -23.931200 50.034431 -60.926540
3 -13.229499 40.402512 -100.824371
4 -18.774120 103.733284 -39.702137
In [49]:
x_embedded_df = pd.DataFrame(x_embedded)
In [50]:
x_embedded_df = x_embedded_df.add_prefix('tsne')
x_embedded_df.head()
Out[50]:
tsne0 tsne1 tsne2
0 -54.326103 18.997721 60.703758
1 2.697969 -7.670809 -92.412666
2 -23.931200 50.034431 -60.926540
3 -13.229499 40.402512 -100.824371
4 -18.774120 103.733284 -39.702137
In [51]:
wine_df_tsne = pd.concat([wine_df, x_embedded_df], axis = 1)
In [52]:
wine_df_tsne.head()
Out[52]:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality tsne0 tsne1 tsne2
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 -54.326103 18.997721 60.703758
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 2.697969 -7.670809 -92.412666
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 -23.931200 50.034431 -60.926540
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 -13.229499 40.402512 -100.824371
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 -18.774120 103.733284 -39.702137
In [53]:
# Let's recreate this scatterplot with plotly
import plotly.express as px
fig = px.scatter(wine_df_tsne, x="tsne0", y="tsne1", color="quality")
fig.show()

Uniform Manifold Approximation and Projection

UMAP documentation: https://umap-learn.readthedocs.io/en/latest/

In [54]:
import umap
reducer = umap.UMAP(n_neighbors = 5
                    ,min_dist = .75
                    ,n_components = 3,
                    metric = 'euclidean'
)
In [56]:
wine_df[wine_df.columns[0:11]].head()
Out[56]:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
In [168]:
wine_embedding = reducer.fit_transform(wine_df[wine_df.columns[0:11]]
                                      )
/Users/matthewberezo/anaconda3/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning:

Embedding a total of 4 separate connected components using meta-embedding (experimental)

In [169]:
wine_embedding.shape
Out[169]:
(1599, 3)
In [170]:
pd.DataFrame(wine_embedding).head()
Out[170]:
0 1 2
0 4.443419 -3.164578 -8.933680
1 10.088109 -2.256029 6.564030
2 4.491574 -7.221073 1.521892
3 5.545253 -6.240511 4.374491
4 4.447597 -3.174667 -8.904904
In [171]:
%matplotlib inline
sns.set(style='white', rc={'figure.figsize':(25,25)})
In [172]:
plt.scatter(wine_embedding[:,0], wine_embedding[:,1], c = wine_df["quality"])
Out[172]:
<matplotlib.collections.PathCollection at 0x1a21846da0>
In [173]:
# let's perform UMAP with a target
embedding = umap.UMAP(n_neighbors = 200,
                     min_dist = 0.15
                    ,metric = 'hamming'
                     #,n_components = 5
                     ).fit_transform(wine_df[wine_df.columns[0:11]], y=wine_df["quality"]
                                     )
In [174]:
embedding.shape
Out[174]:
(1599, 2)
In [175]:
umap_embedding_df = pd.DataFrame(embedding)
In [176]:
umap_embedding_df = umap_embedding_df.add_prefix('umap')
umap_embedding_df.head()
Out[176]:
umap0 umap1
0 6.074284 2.078750
1 7.355929 1.139230
2 6.746639 1.153591
3 -2.874442 5.661441
4 6.040492 2.069470
In [177]:
wine_umap_df = pd.concat([wine_df, umap_embedding_df], axis = 1)
In [178]:
fig = px.scatter(wine_umap_df, x="umap0", y="umap1", color="quality")
fig.show()
In [ ]: